QuicktourのTraining the tokenizerのコード
コードはdataディレクトリの下にwikitextが展開されている前提
special_tokensはvocabularyに挿入される
訓練では全く使われない
pre-tokenizerを追加
Without a pre-tokenizer that will split our inputs into words,
we could get an "it is" token since those two words often appear next to each other.
Using a pre-tokenizer will ensure no token is bigger than a word returned by the pre-tokenizer.
(単語バイグラムがないことを保証していると言えそう)
tokenizer.pre_tokenizerに代入する(もとはNone=pre-tokenizerなし)
訓練(tokenizer.train)
保存
To save the tokenizer in one file that contains all its configuration and vocabulary
「設定と語彙すべてを含む1つのファイルに保存」
読み込み(Tokenizer.from_file)
code:train_tokenizer_tour.py
>> from tokenizers import Tokenizer
>> from tokenizers.models import BPE
>> tokenizer = Tokenizer(BPE(unk_token="UNK")) >> from tokenizers.trainers import BpeTrainer
>> from tokenizers.pre_tokenizers import Whitespace
>> tokenizer.pre_tokenizer
>> tokenizer.pre_tokenizer = Whitespace()
>> tokenizer.train(files, trainer)
>> tokenizer.save("tokenizer-wiki.json")
>> tokenizer2 = Tokenizer.from_file("tokenizer-wiki.json")
code:保存したファイルのキー
$ jq -c 'keys' tokenizer-wiki.json
[
"added_tokens",
"decoder",
"model",
"normalizer",
"padding",
"post_processor",
"pre_tokenizer",
"truncation",
"version"
]